8 research outputs found
Semi-Supervised Event Extraction with Paraphrase Clusters
Supervised event extraction systems are limited in their accuracy due to the
lack of available training data. We present a method for self-training event
extraction systems by bootstrapping additional training data. This is done by
taking advantage of the occurrence of multiple mentions of the same event
instances across newswire articles from multiple sources. If our system can
make a highconfidence extraction of some mentions in such a cluster, it can
then acquire diverse training examples by adding the other mentions as well.
Our experiments show significant performance improvements on multiple event
extractors over ACE 2005 and TAC-KBP 2015 datasets.Comment: NAACL 201
University of Southern Indiana\u27s Solar Eclipse Experience
The University of Southern Indiana Eclipse Ballooning team\u27s experience from May 2016 to August 2017 is comprehensively reviewed. Experience gained during four rehearsal balloon flights is covered, including the need to coordinate with a pre-Senior Design class assisting in three of the flights. Challenges encountered were: learning ballooning techniques, reconfiguring the pod stack, adding new hardware, like a grounding rod and 3D printed standoff, losing tracking visibility due to server crashes at the Borealis hub, and making quick software turnarounds. The students found the networking afforded by the entire experience to be one of the highlights of the project
Information Extraction from Semi-Structured Websites
Thesis (Ph.D.)--University of Washington, 2021The World Wide Web contains countless semi-structured websites, which present information via text embedded in rich layout and visual features. These websites can be a source of information for populating knowledge bases if the facts they present can be extracted and transformed into a structured form, a goal that researchers have pursued for over twenty years. A fundamental opportunity and challenge of extracting from these sources is the variety of signals that can be harnessed to learn an extraction model, from textual semantics to layout semantics to page-to-page consistency of formatting. Extraction from semi-structured sources has been explored by researchers from the natural language processing, data mining, and database communities, but most of this work uses only a subset of the signals available, limiting their ability to scale solutions to extract from the large number and variety of such sites on the Web. In this thesis, we address this problem with a line of research that advances the state of semi-structured extraction by taking advantage of existing knowledge bases, as well as using modern machine learning methods to build rich representations of the textual, layout, and visual semantics of webpages. We present a suite of methods that will enable information extraction from semi-structured sources, addressing scenarios that include both closed and open domain information extraction and varying levels of prior knowledge about a subject domain
PLAtE: A Large-scale Dataset for List Page Web Extraction
Recently, neural models have been leveraged to significantly improve the
performance of information extraction from semi-structured websites. However, a
barrier for continued progress is the small number of datasets large enough to
train these models. In this work, we introduce the PLAtE (Pages of Lists
Attribute Extraction) dataset as a challenging new web extraction task. PLAtE
focuses on shopping data, specifically extractions from product review pages
with multiple items. PLAtE encompasses both the tasks of: (1) finding
product-list segmentation boundaries and (2) extracting attributes for each
product. PLAtE is composed of 53, 905 items from 6, 810 pages, making it the
first large-scale list page web extraction dataset. We construct PLAtE by
collecting list pages from Common Crawl, then annotating them on Mechanical
Turk. Quantitative and qualitative analyses are performed to demonstrate PLAtE
has high-quality annotations. We establish strong baseline performance on PLAtE
with a SOTA model achieving an F1-score of 0.750 for attribute classification
and 0.915 for segmentation, indicating opportunities for future research
innovations in web extraction